240 research outputs found
Learning compact hashing codes with complex objectives from multiple sources for large scale similarity search
Similarity search is a key problem in many real world applications including image and text retrieval, content reuse detection and collaborative filtering. The purpose of similarity search is to identify similar data examples given a query example. Due to the explosive growth of the Internet, a huge amount of data such as texts, images and videos has been generated, which indicates that efficient large scale similarity search becomes more important.^ Hashing methods have become popular for large scale similarity search due to their computational and memory efficiency. These hashing methods design compact binary codes to represent data examples so that similar examples are mapped into similar codes. This dissertation addresses five major problems for utilizing supervised information from multiple sources in hashing with respect to different objectives. Firstly, we address the problem of incorporating semantic tags by modeling the latent correlations between tags and data examples. More precisely, the hashing codes are learned in a unified semi-supervised framework by simultaneously preserving the similarities between data examples and ensuring the tag consistency via a latent factor model. Secondly, we solve the missing data problem by latent subspace learning from multiple sources. The hashing codes are learned by enforcing the data consistency among different sources. Thirdly, we address the problem of hashing on structured data by graph learning. A weighted graph is constructed based on the structured knowledge from the data. The hashing codes are then learned by preserving the graph similarities. Fourthly, we address the problem of learning high ranking quality hashing codes by utilizing the relevance judgments from users. The hashing code/function is learned via optimizing a commonly used non-smooth non-convex ranking measure, NDCG. Finally, we deal with the problem of insufficient supervision by active learning. We propose to actively select the most informative data examples and tags in a joint manner based on the selection criteria that both the data examples and tags should be most uncertain and dissimilar with each other.^ Extensive experiments on several large scale datasets demonstrate the superior performance of the proposed approaches over several state-of-the-art hashing methods from different perspectives
Recommended from our members
Impacts of florfenicol on the microbiota landscape and resistome as revealed by metagenomic analysis.
BACKGROUND:Drug-resistant fish pathogens can cause significant economic loss to fish farmers. Since 2012, florfenicol has become an approved drug for treating both septicemia and columnaris diseases in freshwater fish. Due to the limited drug options available for aquaculture, the impact of the therapeutical florfenicol treatment on the microbiota landscape as well as the resistome present in the aquaculture farm environment needs to be evaluated. RESULTS:Time-series metagenomic analyses were conducted to the aquatic microbiota present in the tank-based catfish production systems, in which catfish received standard therapeutic 10-day florfenicol treatment following the federal veterinary regulations. Results showed that the florfenicol treatment shifted the structure of the microbiota and reduced the biodiversity of it by acting as a strong stressor. Planctomycetes, Chloroflexi, and 13 other phyla were susceptible to the florfenicol treatment and their abundance was inhibited by the treatment. In contrast, the abundance of several bacteria belonging to the Proteobacteria, Bacteroidetes, Actinobacteria, and Verrucomicrobia phyla increased. These bacteria with increased abundance either harbor florfenicol-resistant genes (FRGs) or had beneficial mutations. The florfenicol treatment promoted the proliferation of florfenicol-resistant genes. The copy number of phenicol-specific resistance genes as well as multiple classes of antibiotic-resistant genes (ARGs) exhibited strong correlations across different genetic exchange communities (p < 0.05), indicating the horizontal transfer of florfenicol-resistant genes among these bacterial species or genera. Florfenicol treatment also induced mutation-driven resistance. Significant changes in single-nucleotide polymorphism (SNP) allele frequencies were observed in membrane transporters, genes involved in recombination, and in genes with primary functions of a resistance phenotype. CONCLUSIONS:The therapeutical level of florfenicol treatment significantly altered the microbiome and resistome present in catfish tanks. Both intra-population and inter-population horizontal ARG transfer was observed, with the intra-population transfer being more common. The oxazolidinone/phenicol-resistant gene optrA was the most prevalent transferred ARG. In addition to horizontal gene transfer, bacteria could also acquire florfenicol resistance by regulating the innate efflux systems via mutations. The observations made by this study are of great importance for guiding the strategic use of florfenicol, thus preventing the formation, persistence, and spreading of florfenicol-resistant bacteria and resistance genes in aquaculture
Autoregressive Entity Generation for End-to-End Task-Oriented Dialog
Task-oriented dialog (TOD) systems often require interaction with an external
knowledge base to retrieve necessary entity (e.g., restaurant) information to
support the response generation. Most current end-to-end TOD systems either
retrieve the KB information explicitly or embed it into model parameters for
implicit access.~While the former approach demands scanning the KB at each turn
of response generation, which is inefficient when the KB scales up, the latter
approach shows higher flexibility and efficiency. In either approach, the
systems may generate a response with conflicting entity information. To address
this issue, we propose to generate the entity autoregressively first and
leverage it to guide the response generation in an end-to-end system. To ensure
entity consistency, we impose a trie constraint on entity generation. We also
introduce a logit concatenation strategy to facilitate gradient backpropagation
for end-to-end training. Experiments on MultiWOZ 2.1 single and CAMREST show
that our system can generate more high-quality and entity-consistent responses.Comment: Accepted to COLING 202
Disentangled Phonetic Representation for Chinese Spelling Correction
Chinese Spelling Correction (CSC) aims to detect and correct erroneous
characters in Chinese texts. Although efforts have been made to introduce
phonetic information (Hanyu Pinyin) in this task, they typically merge phonetic
representations with character representations, which tends to weaken the
representation effect of normal texts. In this work, we propose to disentangle
the two types of features to allow for direct interaction between textual and
phonetic information. To learn useful phonetic representations, we introduce a
pinyin-to-character objective to ask the model to predict the correct
characters based solely on phonetic information, where a separation mask is
imposed to disable attention from phonetic input to text. To avoid overfitting
the phonetics, we further design a self-distillation module to ensure that
semantic information plays a major role in the prediction. Extensive
experiments on three CSC benchmarks demonstrate the superiority of our method
in using phonetic information.Comment: Accepted to ACL 2023 Main Conferenc
Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo
Low-precision training has emerged as a promising low-cost technique to
enhance the training efficiency of deep neural networks without sacrificing
much accuracy. Its Bayesian counterpart can further provide uncertainty
quantification and improved generalization accuracy. This paper investigates
low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)
with low-precision and full-precision gradient accumulators for both strongly
log-concave and non-log-concave distributions. Theoretically, our results show
that, to achieve -error in the 2-Wasserstein distance for
non-log-concave distributions, low-precision SGHMC achieves quadratic
improvement
()
compared to the state-of-the-art low-precision sampler, Stochastic Gradient
Langevin Dynamics (SGLD)
().
Moreover, we prove that low-precision SGHMC is more robust to the quantization
error compared to low-precision SGLD due to the robustness of the
momentum-based update w.r.t. gradient noise. Empirically, we conduct
experiments on synthetic data, and {MNIST, CIFAR-10 \& CIFAR-100} datasets,
which validate our theoretical findings. Our study highlights the potential of
low-precision SGHMC as an efficient and accurate sampling method for
large-scale and resource-limited machine learning
Federated Generalization via Information-Theoretic Distribution Diversification
Federated Learning (FL) has surged in prominence due to its capability of
collaborative model training without direct data sharing. However, the vast
disparity in local data distributions among clients, often termed the
non-Independent Identically Distributed (non-IID) challenge, poses a
significant hurdle to FL's generalization efficacy. The scenario becomes even
more complex when not all clients participate in the training process, a common
occurrence due to unstable network connections or limited computational
capacities. This can greatly complicate the assessment of the trained models'
generalization abilities. While a plethora of recent studies has centered on
the generalization gap pertaining to unseen data from participating clients
with diverse distributions, the divergence between the training distributions
of participating clients and the testing distributions of non-participating
ones has been largely overlooked. In response, our paper unveils an
information-theoretic generalization framework for FL. Specifically, it
quantifies generalization errors by evaluating the information entropy of local
distributions and discerning discrepancies across these distributions. Inspired
by our deduced generalization bounds, we introduce a weighted aggregation
approach and a duo of client selection strategies. These innovations aim to
bolster FL's generalization prowess by encompassing a more varied set of client
data distributions. Our extensive empirical evaluations reaffirm the potency of
our proposed methods, aligning seamlessly with our theoretical construct
Attack Prompt Generation for Red Teaming and Defending Large Language Models
Large language models (LLMs) are susceptible to red teaming attacks, which
can induce LLMs to generate harmful content. Previous research constructs
attack prompts via manual or automatic methods, which have their own
limitations on construction cost and quality. To address these issues, we
propose an integrated approach that combines manual and automatic methods to
economically generate high-quality attack prompts. Specifically, considering
the impressive capabilities of newly emerged LLMs, we propose an attack
framework to instruct LLMs to mimic human-generated prompts through in-context
learning. Furthermore, we propose a defense framework that fine-tunes victim
LLMs through iterative interactions with the attack framework to enhance their
safety against red teaming attacks. Extensive experiments on different LLMs
validate the effectiveness of our proposed attack and defense frameworks.
Additionally, we release a series of attack prompts datasets named SAP with
varying sizes, facilitating the safety evaluation and enhancement of more LLMs.
Our code and dataset is available on https://github.com/Aatrox103/SAP .Comment: Accepted to EMNLP 2023 (Findings
Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation
Historical interactions are the default choice for recommender model
training, which typically exhibit high sparsity, i.e., most user-item pairs are
unobserved missing data. A standard choice is treating the missing data as
negative training samples and estimating interaction likelihood between
user-item pairs along with the observed interactions. In this way, some
potential interactions are inevitably mislabeled during training, which will
hurt the model fidelity, hindering the model to recall the mislabeled items,
especially the long-tail ones. In this work, we investigate the mislabeling
issue from a new perspective of aleatoric uncertainty, which describes the
inherent randomness of missing data. The randomness pushes us to go beyond
merely the interaction likelihood and embrace aleatoric uncertainty modeling.
Towards this end, we propose a new Aleatoric Uncertainty-aware Recommendation
(AUR) framework that consists of a new uncertainty estimator along with a
normal recommender model. According to the theory of aleatoric uncertainty, we
derive a new recommendation objective to learn the estimator. As the chance of
mislabeling reflects the potential of a pair, AUR makes recommendations
according to the uncertainty, which is demonstrated to improve the
recommendation performance of less popular items without sacrificing the
overall performance. We instantiate AUR on three representative recommender
models: Matrix Factorization (MF), LightGCN, and VAE from mainstream model
architectures. Extensive results on two real-world datasets validate the
effectiveness of AUR w.r.t. better recommendation results, especially on
long-tail items
A New Technique for Multispectral and Panchromatic Image Fusion
AbstractIn this paper, a technique is presented for the fusion of Panchromatic (PAN) and low spatial resolution multispectral (MS) images to get high spatial resolution of the latter. In this technique, we apply PCA transformation to the MS image to obtain the principal component (PC) images. A NSCT transformation to PAN and each PC images for N level of decomposition. We use FOCC as criterion to select PC. And then, we use the relative entropy as criterion to reconstruct high-frequency detailed images. Finally, we apply inverse NSCT to selected PC's low-frequency approximate image and reconstructed high- frequency detailed images to obtain high spatial resolution MS image. The experimental results obtained by applying the proposed image fusion method indicate some improvements in the fusion performance
- …